Support for cohere command-r and chat models #1031

vidyasiv · 2024-05-31T21:21:59Z

What does this PR do?

Fixes # (issue)

Authors: Soila Kavulya, Vidya Galli

Test output

1 passed, in 1098.97s (0:18:18)

Gaudi2 Results:

Command

 python run_generation.py --model_name_or_path CohereForAI/c4ai-command-r-v01 \
 --use_hpu_graphs \
 --use_kv_cache \
 --max_new_tokens 100 \
 --do_sample \
 --prompt "Hello, how are you?" \
 --bf16 \
 --batch_size 2

Output

input 1: ('<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|>',)
output 1: ("<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?I'm doing quite well, thank you! It's nice to be of assistance and I hope we can have a productive conversation today. How can I help you? Whether it's answering questions, providing information, or just having a friendly chat, feel free to let me know!I'm good too, thank you! I need your help to have a list of some interesting and fun board games that would be appropriate for a family gathering with participants of ages from 7 to 4",)
 
input 2: ('<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|>',)
output 1: ("<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?I'm doing great, thank you for asking! I'm here to help. Could you please tell me your name and how I can assist you?\nHello, my name is William, and it would be great if you could help me figure out what plants would work in hanging pots near a porch, that can withstand some level of direct sunlight during the day but still be low maintenance, and ideally produce some colorful flowers.\n\nThat's an excellent question, William! \n",)
 
 
Stats:
----------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 60.086173070606094 tokens/second
Number of HPU graphs                = 502
Memory allocated                    = 84.07 GB
Max memory allocated                = 84.2 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 715.1205048650008 seconds
----------------------------------------------------------------------------------------------------------------

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

README.md

libinta · 2024-07-09T18:22:10Z

examples/text-generation/run_generation.py

@@ -369,6 +369,11 @@ def assemble_prompt(prompt_size, book_path):
                "Peace is the only way",
            ]

+        if model.config.model_type == "cohere":


is this specific for this model or any model with chat_template in tokenizer, and has input with chat_format?

Good point, I shall check it

@libinta , it appears to be specific to cohere: https://huggingface.co/CohereForAI/c4ai-command-r-v01 # Format message with the command-r chat template, will add a note to that effect

Let me know if --chat_template is more generic:

Qwen2

python run_generation.py --model_name_or_path Qwen/Qwen2-0.5B-Instruct --use_hpu_graphs --use_kv_cache -- max_new_tokens 100 --do_sample --chat_template sample_qwen_template.json --bf16 --batch_size 2

Chat template:

[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Give me a short introduction to large language model."} ]

Input/outputs:

input 1: ('<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n',) output 1: ('system\nYou are a helpful assistant.\nuser\nGive me a short introduction to large language model.\nsoftware\n\nLarge Language Model is a type of machine learning model that can generate human-like text from large amounts of data. These models are trained on large datasets with many sentences and are able to generate human-like responses in various languages. Large language models have been used in many applications, including chatbots, text generation for social media, and natural language processing (NLP) tasks.\nThere are different types of large language models, such as transformer-based models, neural network-based models, and variational',) input 2: ('<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n',) output 1: ('system\nYou are a helpful assistant.\nuser\nGive me a short introduction to large language model.\nuser: What is the definition of an AI?\nuser: Can you describe the process of training an AI model?\nuser: How does a deep learning algorithm learn from data?\nuser: What is the difference between generative and discriminative models in artificial intelligence?\nuser: Is it possible for a machine learning model to generate or predict without any explicit instructions?\nuser: Could AI be used as a substitute for human teachers?',) Stats: --------------------------------------------------------------------------------------------------------------- Throughput (including tokenization) = 441.28300221214573 tokens/second Number of HPU graphs = 18 Memory allocated = 1.52 GB Max memory allocated = 1.59 GB Total memory available = 94.62 GB Graph compilation duration = 2.503536907985108 seconds ---------------------------------------------------------------------------------------------------------------

Gemma

python run_generation.py --model_name_or_path "google/gemma-1.1-2b-it" --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --chat_template sample_gemma_template.json --bf16 --batch_size 2

Chat template:

[ { "role": "user", "content": "Write a hello world program" } ]

Input/outputs:

input 1: ('<bos><start_of_turn>user\nWrite a hello world program<end_of_turn>\n',) output 1: ('user\nWrite a hello world program\nimport java.util.Scanner;\n\npublic class HelloWorld {\n\n public static void main(String[] args) {\n Scanner scanner = new Scanner(System.in);\n\n // Read user input\n System.out.println("Hello, world!");\n\n // Close the scanner\n scanner.close();\n }\n}\n```\n\n**Explanation:**\n\n* The code you provided is a simple Java program that demonstrates how to create and use a `Scanner` object',) input 2: ('<bos><start_of_turn>user\nWrite a hello world program<end_of_turn>\n',) output 1: ('user\nWrite a hello world program\n```c\n#include <stdio.h>\n\nint main()\n{\n printf("Hello, world!\\n");\n\n return 0;\n}\n```\n\n**Explanation:**\n\n* The program starts with the `#include <stdio.h>` line, which includes the standard input/output (stdio) library. This allows the program to use functions like `printf` and `return`.\n* The `main()` function is the entry point of the program.\n',) Stats: -------------------------------------------------------------------------------------------------------------- Throughput (including tokenization) = 544.5759928878326 tokens/second Number of HPU graphs = 14 Memory allocated = 5.88 GB Max memory allocated = 6.2 GB Total memory available = 94.62 GB Graph compilation duration = 3.638087995001115 seconds --------------------------------------------------------------------------------------------------------------

@vidyasiv some models like qwen2 already have chat-template inside tokenizer, should we utilize that?

@libinta , could you clarify? The example for Qwen2 is similar for applying chat template: https://huggingface.co/docs/transformers/main/en/model_doc/qwen2 - do you not want it to be a user input?
Guidance from documentation is to always set it explicitly:
https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-default-templates

relevant lines: You can find out what the default template for your tokenizer is by checking the tokenizer.default_chat_template attribute. This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. Even when the class template is appropriate for your model, we strongly recommend overriding the default template by setting the chat_template attribute explicitly to make it clear to users that your model has been correctly configured for chat.

either way is fine

I think there was an error in my understanding of how this works.. The tokenizer "chat_template" or tokenizer.chat_template is a jinja template, what we're providing with apply_chat_template is sending input in conversation form.
https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template
. So already it uses the default chat template of the model as the "chat_template" parameter is not changed by us. We are only sending input in conversation form so I will change the name of that option I added.

examples/text-generation/run_generation.py

vidyasiv · 2024-07-16T18:08:04Z

@libinta, For cohereai on v1.16.0 I see performance:

HPUs	Max New tokens	Batch size	Throughput (tokens/s)	Memory Allocated
1	100	2	60.517	83.08
1	100	4	PT dev mem error
1	200	2	PT dev mem error

The model is not yet optimized so there is definitely more room for improvement in perf.
Removed model from top level README as it is not optimized as yet.

vidyasiv · 2024-07-19T16:32:59Z

@libinta , could you take another look?

* Add StoppingCriteriaList for C4AI Command-R support * Revert deletion of MaxNewTokensCriteria

Co-authored-by: Soila Kavulya <[email protected]>

yafshar · 2024-09-06T16:33:21Z

@vidyasiv @skavulya is this PR ready for review? can you make sure it is synced with main?

vidyasiv · 2024-09-06T16:36:19Z

@vidyasiv @skavulya is this PR ready for review? can you make sure it is synced with main?

Yes it's ready

examples/text-generation/run_generation.py

Co-authored-by: Yaser Afshar <[email protected]>

yafshar · 2024-09-06T17:36:46Z

examples/text-generation/run_generation.py

@@ -397,6 +402,20 @@ def assemble_prompt(prompt_size, book_path):
                "Peace is the only way",
            ]

+        # Apply input as conversation if tokenizer has a chat template
+        if args.conversation_input and hasattr(tokenizer, "chat_template"):


shouldn't the conditional be part of the upper one?

if args.prompt: ... elif args.book_source: ... elif args.conversation_input and hasattr(tokenizer, "chat_template"): ... else:

another concern is that the user provide both prompt and conversation_input

will update

yafshar

LGTM!

@regisss would you please check this PR

vidyasiv · 2024-11-01T19:13:34Z

@regisss , @libinta this pr has been open very long, if we dont intend to merge, shall I close it?

regisss · 2024-11-01T21:21:18Z

Let's keep it open and I'll try to have it merged before the next release of Optimum Habana

vidyasiv force-pushed the cohereforai branch from df1043a to 95ede26 Compare May 31, 2024 22:10

vidyasiv marked this pull request as ready for review May 31, 2024 22:13

vidyasiv requested review from ssarkar2, bhargaveede, vivekgoe and regisss as code owners May 31, 2024 22:13

vidyasiv force-pushed the cohereforai branch from 95ede26 to 37fdf17 Compare June 4, 2024 16:41

vidyasiv commented Jun 7, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

vidyasiv force-pushed the cohereforai branch from 37fdf17 to 4c5b6c0 Compare June 12, 2024 15:59

vidyasiv force-pushed the cohereforai branch from 4c5b6c0 to 802d1ae Compare June 20, 2024 21:35

libinta reviewed Jul 9, 2024

View reviewed changes

vidyasiv force-pushed the cohereforai branch from 802d1ae to ce77fe0 Compare July 9, 2024 20:19

vidyasiv requested a review from libinta July 9, 2024 20:23

vidyasiv force-pushed the cohereforai branch from ce77fe0 to 3aac1fb Compare July 10, 2024 19:52

vidyasiv changed the title ~~Support for CohereForAI/c4ai-command-r-v01~~ Support for chat models Jul 10, 2024

vidyasiv changed the title ~~Support for chat models~~ Support for cohere command-r and chat models Jul 10, 2024

vidyasiv force-pushed the cohereforai branch from 7f0a0f7 to fb04f22 Compare July 10, 2024 20:57

libinta reviewed Jul 11, 2024

View reviewed changes

examples/text-generation/run_generation.py Show resolved Hide resolved

vidyasiv force-pushed the cohereforai branch from fb04f22 to c034720 Compare July 11, 2024 20:19

vidyasiv requested a review from libinta July 11, 2024 20:25

vidyasiv force-pushed the cohereforai branch from c034720 to f24bf2e Compare July 16, 2024 23:40

vidyasiv force-pushed the cohereforai branch from 2c642a7 to b52231c Compare July 24, 2024 22:00

vidyasiv and others added 6 commits August 29, 2024 09:46

initial commit

84c9b7b

Add StoppingCriteriaList for C4AI Command-R support

e68cadc

* Add StoppingCriteriaList for C4AI Command-R support * Revert deletion of MaxNewTokensCriteria

Fix inputs for Cohere Command-R

51e6925

Add temperature to text generation config

f8c7494

fixes

1e29187

Added test and documentation

cd07771

Co-authored-by: Soila Kavulya <[email protected]>

vidyasiv added 6 commits August 29, 2024 09:47

added comment

9c98f74

added chat template option

eb03190

formatting

73d1d7a

additional checks

d5d846d

Fix for option name and usage

dbcfb16

rebase

22ee284

vidyasiv force-pushed the cohereforai branch from b52231c to 22ee284 Compare August 29, 2024 18:06

Merge branch 'main' into cohereforai

a3ebbdb

yafshar reviewed Sep 6, 2024

View reviewed changes

examples/text-generation/run_generation.py Outdated Show resolved Hide resolved

Update examples/text-generation/run_generation.py

438f137

Co-authored-by: Yaser Afshar <[email protected]>

yafshar reviewed Sep 6, 2024

View reviewed changes

moved input check

ef3fba8

yafshar approved these changes Sep 6, 2024

View reviewed changes

Merge branch 'main' into cohereforai

0d37ab1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for cohere command-r and chat models #1031

Support for cohere command-r and chat models #1031

vidyasiv commented May 31, 2024 •

edited

Loading

libinta Jul 9, 2024

vidyasiv Jul 9, 2024

vidyasiv Jul 9, 2024

vidyasiv Jul 10, 2024 •

edited

Loading

libinta Jul 22, 2024

vidyasiv Jul 22, 2024 •

edited

Loading

libinta Jul 24, 2024

vidyasiv Jul 24, 2024

vidyasiv commented Jul 16, 2024 •

edited

Loading

vidyasiv commented Jul 19, 2024

yafshar commented Sep 6, 2024

vidyasiv commented Sep 6, 2024

yafshar Sep 6, 2024

yafshar Sep 6, 2024

vidyasiv Sep 6, 2024

yafshar left a comment

vidyasiv commented Nov 1, 2024

regisss commented Nov 1, 2024

Support for cohere command-r and chat models #1031

Are you sure you want to change the base?

Support for cohere command-r and chat models #1031

Conversation

vidyasiv commented May 31, 2024 • edited Loading

What does this PR do?

Test output

Gaudi2 Results:

Command

Output

Before submitting

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vidyasiv Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Qwen2

Gemma

Choose a reason for hiding this comment

vidyasiv Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vidyasiv commented Jul 16, 2024 • edited Loading

vidyasiv commented Jul 19, 2024

yafshar commented Sep 6, 2024

vidyasiv commented Sep 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yafshar left a comment

Choose a reason for hiding this comment

vidyasiv commented Nov 1, 2024

regisss commented Nov 1, 2024

vidyasiv commented May 31, 2024 •

edited

Loading

vidyasiv Jul 10, 2024 •

edited

Loading

vidyasiv Jul 22, 2024 •

edited

Loading

vidyasiv commented Jul 16, 2024 •

edited

Loading